I like to contrast traditional case-based testing with property-based testing like this:
With both approaches, you want to show that certain claims hold about the software you are testing. With case-based testing, the claims are implicit (implied by your cases), and the cases are explicit. With property-based testing, the claims are explicit (your property specifications) and the cases are implicit (being some fuzzed subset of the claim's domain).
Case-based testing gives a formal proof of an existential property, i.e. "these assertions hold for some inputs" (the proof is some explicit example where the assertions hold)
Property-based testing gives approximate checks of a universal property, i.e. "these assertions hold for all inputs" (the check is a double-negative: are we unable to find a counter-example?)
I've found that many developers dismiss more sophisticated testing/checking approaches as being 'too academic' (e.g. "we don't need formal proofs"), 'too limited', 'impractical for real problems', etc. I like the above characterisation, since it turns that on its head: we can legitimately argue that case-based testing is too limited (it can only express existential properties, when we usually want universal ones) and also too academic (it gives formal proofs, which "we don't need").
Which would be great if it wasn't too computationally expensive. Most of the time, it just becomes generating a subset of random (within range) input values to be tested thus decreasing the utility.
We need better structured code such that each tests can be tied to specific code - code doesn't change - test doesn't re-run.
That is way too hard and too risky using current tech and unit testing doesn't solve the problem.
> Which would be great if it wasn't too computationally expensive. Most of the time, it just becomes generating a subset of random (within range) input values to be tested thus decreasing the utility.
Those are complaints about the testing/checking process; I was talking about test/property definitions.
In particular, most of the properties we care about are universal ("for all"), but unit tests cannot even state those properties, let alone test them, let alone prove them. Property-based testing lets us state universal properties, and it lets us test them (although it cannot prove them). Those are two advantages compared to unit/functional tests.
Regarding the actual checking process, and the cost/benefit it provides, that varies depending on the project's needs, the tools/frameworks used (e.g. random generation versus enumeration versus on-demand generation, etc.), the way testing is implemented (e.g. constructive generation versus filtering, domain-specific types versus general-purpose language builtins, etc.).
Personally, property-based testing is my default approach (since "normal" unit/functional tests are simply parameterless property tests); I've done it in Haskell, Javascript, Python, Scala, Isabelle/HOL and PHP; and it's found all sorts of issues with my code that I would never have thought to write tests for.
We use A LOT of Property-Based Testing at Auxon. It's both a fantastically practical addition to existing testing methods as well as a pretty viable on-ramp to the way of thinking about your software/systems that more "formal" verification approaches require of you. Earlier this year one of my teammates wrote a little bit deeper dive into the practice, rather than the premise of PBT that I thought was quite good. Granted I'm almost certainly biased, but in case anyone wants some further context and caveats from someone who's extensively used PBT "in anger" in multiple programming languages, then I think it's worth taking a look: https://blog.auxon.io/2021/02/01/effective-property-based-te...
I generally conflate fuzz and property based testing, particularly when I nearly always mean property based testing. This is lazy, and is lazily justified with "fuzzing is pbt with very generic inputs / pbt is fuzzing with more intelligent inputs".
Regardless, the hard part with pbt is generating the inputs so they cover real cases and dont accidentally systematically miss some (this is often achieved by choosing different testing boundaries). The hard part with fuzzing is knowing you've done enough that you've hit many real data flows.
My understanding is that while the how is similar (both throw mountains generated input at the system), there's a qualitative difference in that pbt asserts that some fact holds true of the output, while fuzzing "just" tries to make it crash. Put another way, fuzzing might be described as testing the property that "the system under test doesn't crash". In practice fuzz testing is a special enough case that it makes sense to break it out into its own thing with its own tools.
I think there's another difference between the two that isn't there by definition but manifests in practice:
If we were to imagine a simplified landscape in which all tests fall on a one-dimensional axis, from "quick unit tests" on the left to "slow integration tests" on the right , fuzzing is even further off to the right.
Fuzzing is often very slow and operates on an entire black-box program, trying to understand and break it. Access to source code isn't always needed.
My understanding may be incomplete; please correct me if I missed something.
There's a lot of overlap. In fuzzing you generally generate inputs that exercise the underlying (hidden) branching conditions. Some explicit, some implicit.
In PBT you generate inputs that exercise branching and value range conditions for a known subset of target variables or other internal states. The targets being tested are explicit.
Or, to quote David on his visit to our office: "Hypothesis is basically a python bytecode fuzzer."
My take: don't shy away from the logic perspective of what properties are, because it's the simplest explanation. A property is a logical proposition, e.g. "p implies q". Property based testing randomizes values of p and q as a proxy for all possible values, since exhaustive testing is practically impossible in almost all cases. The hypothesis is that running these tests for long enough will eventually expose a bug, since more and more of the input space gets generated over time.
PBT is an addicting tool. I used it on a personal project involving a lot of date filtering, and it found so many edge cases related to date parsing and timezone handling. In my experience, you just also write way less test code since the test library is generating the test cases for you.
The two major downsides are the difficulty in expressing useful properties, and test runtime. Regarding expressing properties, it is difficult, but I wonder if it can't be improved with practice. Hillel Wayne wrote a bit about properties here: https://www.hillelwayne.com/contract-examples/
Regarding the test runtime - it's just something to be wary of. You can easily start running tens of thousands of test cases depending on how you configure the library. It makes things like testing against a real database really impractical, unless you really control which inputs get generated, which defeats the purpose of random data generation.
I think parallelization and non-blocking test suites are the answer here. Locally and in CI, they can be run with a reasonable N so that obvious errors hopefully get caught, but large N's get saved for an async test suite that can benefit from high parallelization.
My last point about them is, even after all their benefit, example-based tests are still easier to write and serve their purpose. They also allow you to test specific edge cases explicitly.
Like everything else, PBT is just another tool in the shed.
> Regarding the test runtime - it's just something to be wary of. You can easily start running tens of thousands of test cases depending on how you configure the library. It makes things like testing against a real database really impractical, unless you really control which inputs get generated, which defeats the purpose of random data generation.
Whether that's practical or not depends on your tooling and your resources. If you're willing to spend US$100k on a test machine, and some people are, you can have a 4096-node test cluster.
The article separates types from property testing, I can sympathize, it is true, it makes sense.
But testing is a tool for reliability to be used in conjunction with so many other tools like, you guessed it, typing.
In 2016 when this article was written, typing wasn't yet popular in Python. But now it is, and indeed more popular than property testing will ever be.
But Hypothesis doesn't support typing very well, though they do attempt [0]. But just take a look at that list of caveats. It is a hack job in comparison.
Even if you don't generate your Strategies from types, Hypothesis requires you to know whether the type you construct is internally a sum type or a product type. [1] The testing framework shouldn't have to know these details, this is unnecessary complexity.
If you think the QuickCheck strategy of having lawless unnamed instances is bad, there is now a Haskell library[2] where Gens are values, like they are in Hypothesis (where they are called Strategies). It probably became popular only after 2016, which could be why the article doesn't mention it. Hypothesis also supports "registering" strategies so that they are auto-discovered on-demand. It's done at runtime of course, unlike with QuickCheck, where it is type class based, as previously mentioned.
I applaud the article for cementing that property checking can work without typing, but I worry that readers will conclude that Haskell does it wrong, which is a false implication. You don't have to use ad-hoc polymorphism for discovering strategies.
My experience of property based testing is that you end up either testing trivial stuff that you don't really care about (sure, maybe you have a bug with extremely large integers that can't actually happen in real life), or you're creating tests every bit as complex and buggy as the implementation itself to represent realistic states. Quite often you're just directly duplicating the logic you're testing to prove it works in all cases.
These just seem like the worst kinds of test, highly coupled with implementation details. For the benefit of finding some bugs with edge cases (which may or may not be spurious), I just don't believe it's worth the effort, given that you still need all your normal tests anyway. Maybe I am just unlucky not to work in a more pure, mathsy domain, I dunno.
My experience with PBT is in testing compilers. In this problem domain, it finds bugs that would be near impossible to find with traditional testing approaches. And the test generator can be made independent of the implementation, if one is testing standard language features.
I have applied it to testing Common Lisp implementations, but there has been much work on applying it to other languages (most famously C and Javascript.)
I applied the following techniques:
(1) Generate random valid well defined programs and see (a) if they crash the compiler, (b) cause different CL implementations to produce different outputs, (c) when modified (by addition of randomly generated optimization directives or valid type declarations) they still produce the same output. This is differential testing, which was used by McKeeman at DEC in the 1990s to test C compilers, and later improved (again, on C compilers) by Yang, Chan, Eide, and Regehr at U. of Utah (csmith, creduce).
Since a running lisp image can generate and compile functions internally (this IS lisp, after all), the testing loop can be very fast. Since 2003 I have run this on and off for many billions of iterations on desktop and laptop machines on various CL implementations, now mainly on SBCL. Most of the test input reduction is handled automatically, which is a big help.
(2) Generate random possibly invalid code by mutating or recombining snippets drawn from a large corpus of code, to see if it crashes the compiler (in CL implementations where the compiler is promised to never respond to bad code by signaling an error.) This was also the approach jsfunfuzz took on Javascript.
(3) Extensive fuzzing of calls to standard functions in CL, using random generation of input values and random generation of valid type declarations, with the invariant being that the same values should be computed (and the compiler not fail.) This is a specialization of (1), but was sufficiently different that the bugs it found were not the same.
The experience with this sort of testing of compilers (in any language) is that if the compiler (free or commercial) has never been subjected to it, it will immediately find bugs in the compiler.
Out of curiosity, how do you go about generating "random valid well defined programs"? I thought about doing something like this for testing my own toy language, but wasn't sure how to go about generating random test cases.
Basically by a top down approach. For example, to generate a random integer valued form, pick one of N fixed rules and invoke it. A rule for a form of size S might be "generate some value k in [1,S-1] and then partitions S-1 into k integers, generate an integer valued form for each, then produce the form that is their sum."
At the leaf nodes, generate constants or variables. Keep track of the variables that are in scope as you do this (and their types.)
This is a pretty standard approach used in compiler testing. In languages like C you have problems ensuring the program doesn't cause undefined behavior, but Common Lisp is much nicer in that respect -- for example, evaluation order of arguments to functions is defined, and you don't have to worry about integer operations overflowing.
All this has made me appreciate language specifications that really nail down what programs are supposed to do.
The forms generated are constrained if one wants them to be easily reducible while preserving validity. Hypothesis has a nice solution to that problem, which I didn't use, as this was done a decade and a half earlier.
I can see how you might avoid "highly coupled with implementation details" example-based tests. With a little abstraction, you can do the same with property-based tests.
Properties as external assertions that you code up as a test. An easy example is to think about the contract of a stack, and the properties that result, and how you could test those properties with arbitrary example: pushing arbitrary t of T on a stack of T results in a stack 1 deeper popping a non-empty stack results in a stack 1 shallower, etc.
> how do you use PBT to show that some operation is thread-safe?
One approach is to make values to represent the "actions" we care about (e.g. "increment", "sendMessage", "delete", etc.), and write tests which take random lists of actions as arguments and execute those actions.
We can calculate the expected result by either using a non-threaded 'oracle', or by generating actions with a known result (e.g. if we always generate equal numbers of "increment" and "decrement" actions, then the result should match the original value).
As an example, I wrote a dynamic binding mechanism based on thread-local storage, which required special handling of thread pools to ensure futures wouldn't see stale values. To test this, I made a tiny domain-specific language (AKA a handful of classes) defining futures/awaiting, dynamic binding/lookup, and string literals/appending. I made two interpreters (AKA recursive functions) for this DSL: one using real futures/threads/dynamic-binding, the other just a straightforward calculation with a couple of stacks. To test the correctness of my dynamic binding mechanism in the face of futures/threading, I simply test that randomly-generated programs in that DSL will give the same result from both interpreters. (I also wrote a few properties to test each interpreter too!)
i use property based testing for the external api of my business domain service.
yes, the test code is an “oracle”, knowing what the service _should_ be doing. it’s a parallel implementation of the logic, highly geared towards the use-case of testing.
Imho, property based testing makes you think a little bit harder about the invariants in your code, witch by itself only it's a win. Then there's this https://groups.google.com/g/leveldb/c/gnQEgMhxZAs
That's quite interesting. We've been porting some of the properties in PBT for ML [1]. It really works and makes a big difference even for the AI space if we focus of semantically meaningful changed (that are a bit easier to replicate in the software world.)
It might be just the projects I work on but I find that the vast majority of bugs I encounter are ones that property testing would not catch.
The two most recent (entirely typical) examples Im thinking of being third party systems behaving unexpectedly and a "bug in requirements".
I can see that where it might be effective would be in finding obscure edge cases in very complex logical code with simple inputs and outputs - e.g. a DSL parser or a sophisticated pricing engine. Weirdly it seems I do less and less of that kind of thing every year.
I keep being amazed just how good you are at finding relevant discussions. And thanks for posting them, as they usually have interesting discussions as well.
With both approaches, you want to show that certain claims hold about the software you are testing. With case-based testing, the claims are implicit (implied by your cases), and the cases are explicit. With property-based testing, the claims are explicit (your property specifications) and the cases are implicit (being some fuzzed subset of the claim's domain).
reply